oo 

o 

O 
(N 






a 



The exponentially truncated q-distribution: A 
generalized distribution for real complex 

systems. 



^^ Hari M. Gupta, Jose R. Campanha. 

;^ IGCE - Departamento de Fisica, IGCE, UNESP 
g Caixa Postal 178, CEP 13500-970 

S Rio Claro - Sao Paulo - Brazil 

July 8, 2008 



Abstract 



I To know the statistical distribution of a variable is an important 

^ problem in management of resources. Distributions of the power law 

O type are observed in many real systems. However power law distribu- 

I ^1 tions have an infinite variance and thus can not be used as a standard 

distribution. Normally professionals in the area use normal distribu- 
h^ tion with variable parameters or some other approximate distribution 

(yT] like Gumbel, Wakeby, or Pareto, which has limited validity. 

^^ Tsallis presented a microscopic theory of power law in the frame- 

^^ work of non-extensive thermodynamics considering long-range inter- 

im actions or long memory. In the present work, we consider sotting of 

^^ long-range interactions or memory and presented a generalized dis- 

OO tribution which have finite variance and can be used as a standard 

T\ distribution for all real complex systems with power law behaviour. 

^ ^ We applied this distribution for a financial system, rain precipitation 

S^ and some geophysical and social systems. We found a good agree- 

H ment for entire range in all cases for the probability density function 

(pdf) as well as the accumulated probability. This distribution shows 
universal nature of the size limiting in real systems. 



I. Introduction 



To know the statistical distribution of a variable is an important problem 
in management. For example, the distribution of the variation of a share 
price is important for financial management, while the distribution of the 
water flux or water level in a river or rain precipitation is important for 
water and flood management. 

Recently physicists started to study the natural systems as a whole rather 
than in parts [1-6] and are interested in holistic properties of these systems 
normally called "Complex Systems". This also include financial, social, bi- 
ological, economical and geophysical systems as they have the same charac- 
teristics. 

Power law scaling [7,8] is observed in many such systems [2,9-29] and it 
is now considered an important property of these systems. In general, power 
law exists in the central part of the distribution. It deviates from power law 
for very small and very large steps. 

Long-range interactions and memory effects are present in all real sys- 
tems including social and economical sytems [30,31] and are important for a 
statistical distribution. Tsallis through non-extensive thermodynamics gives 
a microscopic basis for power law [32,33] considering long-range interactions 
and long memory effects. The distribution also explain the initial deviation 
from power law 

Power law have infinite variance which discourage a physical approach 
and an unavoidable cut-off is always present. Mantega and Stanley [34] in- 
troduced the truncated Levy flight in which the probability of taking a step 
is abruptly cut to zero at a certain critical step size. Koponen [35], gradually 
truncated the probability distribution from the begining. This violate power 
law distribution in central part. Gupta and Campanha [36-38] proposed the 
gradually truncated Levy flight in which the probability distribution is cut- 
off gradually only after a certain critical value. This distribution have an 
undesirable discontinuity at the critical length. Tsallis et. al. [39] consider 
a cross-over behaviour to explain deviation from power law for extreme val- 
ues. This distribution do not explain a sharp cut-off as observed in many 
cases [40]. Thus, in absence of a standard distribution for these systems, in 
practice, normal distribution with variable parameters is used. When one is 
interested in extreme value distributions, some other distribution like Wakely, 
Gumbel, exponential, log-normal etc [41] are used, which are valid only in a 
particular range and generally do with provide a physical basis. 

A statistical distribution for these systems must have: 

(i) finite variance, 



(ii) continuous distribution, 

(iii) power law in the central part, 

(iv) can explain all kind of cut-off from very sharp to very slow for extreme 
values, 

(v) must have a physical basis for the truncation of the power law. 

In the present paper, we propose softing of long-range interactions or long 
memory with increase of the variable size under consideration. This avoid in- 
finite variance. Finally we present a generalised statistical distribution based 
on this concept. With the avalaibility of computer programs to numerically 
integrate an function, this distribution can be used to calculate the proba- 
bility distribution function and accumulated probability in any range of the 
distribution with fixed parameters. In Section II, we present the model and 
the distribution. In Section III, we present a method to estimate the param- 
eters of the distribution. In Section IV, we apply this statistics for many 
problems in diverse areas and finally in Section V, we discuss the results. 



II. The model and the statistics 

In 1998, Tsallis [32], presented non-extensive thermodynamics in which he 
incorporated long range interactions and long memory effects. He proposed 
a generalized definition of entropy {Sq): 



1 _ V^ v" 
Sg = C ^='^' (1) 

w 



Ep^ = i) 



i=l 



where C is a positive constant, and W is the total number of microscopic 
possibilities of the system, q is an entropic index, which plays a central 
role and is related to long range interactions and long memory effect in 
a network. This expression recovers the usual Boltzmann-Gibbs entropy 
(— C^j^^Pi Inpj) in the limit g — > 1, i.e. in short range interactions. In this 
case, the size frequency distribution function N{x) is given through 



^ = -AiV(x) (2) 

ax 

where A is a positive constant. N{x) is the frequency probabihty of size x. 
This gives 



iV(a;) =iVoexp(-Ax) (3) 

In general, the frequency density distribution function N(x) is given through: 

dN{x) 
dx 

hence 



-\N''{x) (4) 



N{x) = ^ — (5) 

[1 + (g- l)Aa;]'?-i 

where A'^o is a normahzation constant. This expression recovers the usual 
Boltzmann distribution in the limit g — > 1 i.e. in short-range interactions 
as shown in Euation (3). For g > 1, this expression gives power law for 
relatively large values of the step x. 

The power law distribution can not continue forever in real systems. It 
has to be truncated in some way to avoid infinite variance and have a finite 
size. 

In order to consider long range departure, Tsallis et. al. [39] assume a 
crossover to another type of behavior and modify Equation (4) as 



^^^ = -/i^iV-(x) - (A - fi^)N'^{x) (6) 

dx 

fij. is very small compared to A. That gives a crossover between two different 
power laws (respectively characterized by q and r) or from power law to 
normal distribution within a nonextensive scenario. 



Although cross-over behavior as suggested by TsaUis can avoid an infinite 
variance, in the present work, we are looking for another possibility, i.e., the 
truncation of power law due to softing of long-range interaction or memory 
which gives finite size in real systems. This is not a cross-over behavior. 
We consider that entropy factor q decreases with step size (x) due to the 
softening of long-range interactions or memory effects which arises because 
of the physical limitations of the components or the system itself. Thus q 
depends on the step size. This is similar as anharmonic terms are important 
for calculating potential energy in lattice vibrations. 

The size limiting factor is of a very small importance for small steps, 
while it is necessary for larger steps. Entropy index q is equal to 1 in the 
absence of long memory or long range interactions. Thus the information 
about these interactions are given through (g — 1). We consider that this 
factor approaches to zero for very large values of x. In general, for this: 



(^(-) - 1) = 7^^^ (7) 



EO^X'^ 



where go and q{x) are values of entropy index q for step size zero and step size 
X respectively. 6i and i are adjustable parameters depending on the softing 
of long-range interactions or memory. 

To simplify, we propose an exponential decay i.e. 



q{x)-l = {qo-l)exp{-{9xy) (8) 

where 6 and i show the rate of decrease of the importance of these interactions 
with the increase of step size x. The higher value of i indicates a sharper cut- 
off. ^ is a scaling factor for cut-off. 

For very large values of x, q{x) approaches to 1 and thus gives normal 
distribution as required through central limit theorem. In the present model 
the distribution function is given through: 



N{x) = No[l + (go - l)Axexp(-(^x)*)]~(""P«''^)"))/(*-i) (9) 



For simplification we replace (go — 1)A by another constant P, and l/{qo — 
1) through the power exponent a. Finally the frequency density distribution 
is given by 



N{x) = No[l + (3xexp{-{exy)] 



-Q:exp({6a;)*) 



(10) 



In Figure 1 we compare N{x) vs. x in present approach through Equation 
(10) for very large steps. Under the present model, the gradual truncation 
of the power law can be adjusted from very sharp to very slow through the 
value of i without interfering in power law behavior in the central part of the 
distribution. In cross-over behaviour we can not explain very sharp cut-off 
without violating power law in the central part [40]. 
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Figure 1 - Theoretical distribution of present Model in log-log scale. We 
consider: Nq = 1.10^ /3 = 0.0025; a = 2.0. Curves A, B, C, D, E, and F are 
through considering i = 1/2 and 6 = 3.10"^ (Curve A), i = 1 and 6 = 3.10"^ 
(Curve B), i = 2 and 6 = 1.10"^ (Curve C), i = 3 and 6 = 1.5.10"^ (Curve 
D), z = 4 and e = 1.8.10"^ (Curve E) and « = 5 and 9 = 2.10"^ for Curve F. 



In terms of probability density distribution {p{x)), the distribution is: 



p(x) = fc[l + /5a;exp(-(^a;)*)]-"^"P«^^)') (11) 

where k is another normahzation constant and equal to {No/Ntotai), where 
Ntotai is the total number of observations. 
Further 



p(x)dx = 1 (12) 



Many times, the maximum frequency is not at x = 0. In this case we 
need to shift the origin to Xm, where Xm is the most frequent value of the 
variable x. At Xm the frequency is maximum, however it is not necessary a 
mean value of the variable. In this case the frequency density distribution is 
given through: 



N{x) = No[l + (3\x- x^\ exp{-{e \x - a;„|)*)]-"^"P«''l^-^'"l)') (13) 



The physical mechanisms behind the distribution for x > Xm and x < Xm 
may be different and thus the parameters of the distribution may also be 
different. Thus the two cases must be treated separately. 

For ^ = 0, the present distribution turn out to be Tsallis distribution 
[33,39] as is expected. 

The accumulated probability in between a and b is given through 



P(a<x<(,) = /pWdx 



(14) 



The value of p{x) at a; = must be carefully studied, particularly when 
Xm = 0. In many cases it may be a discrete number and include many 
cases apart from the mechanism under discussion. For example, in financial 
market, the variation zero in the price of a particular share also include cases 



in which it is not at all traded along with cases in which it is traded but 
with zero variation. Thus in these cases p{0) should be separately estimated 
through 



^ Events x = 
Total events 



and thus 



oo 

/ 



p{x)dx = p(0) + / p{x)dx + / p{x)dx = 1 (16) 

0+ 



where 0~ and O"*" are respectively the values of x below and above zero. 
We define N{x) as events in the range {{x — ^) < x < {x+ ^)) divided by 
Ax and thus eliminate p(0) when x 



Ax 
2 ■ 

There is very little mass for extreme values of x and thus it is difficult to 
compare a theoretical distribution with available data. The technique known 
as a Zipf plot [38] is very important in this case. Suppose we ordered our 
observations from largest to smallest so that the index i is the rank of Xi. 
Then 



i = N p{x)dx (17) 



Thus the rank is simply a transformation of the accumulative distribution 
function. The empirical accumulated probability above Xi is its rank i divided 
by total number of observations. The accumulated probability accentuates 
the upper tail of the distribution and therefore makes it easier to detect the 
deviations in the extreme tails from the theoretical predictions of a particular 
distribution. 



III. Estimation of parameters. 



We use the following steps to estimate the parameters of the distribution. 
Let {xi,X2, ■■■Xn) be the set of A^ observations of a random variable x for 
which the probability density function is p{x). We select a proper bin size Ax 
and make a frequency table f{x) vs x. f{x) gives the number of observations 
in between {x— ^) and {x+ ^). From this table we observe the value of x^ 
i.e. the value of x, for which we have the maximum value of the frequency 
(A^o)- We separate the observations in two groups, one for x > Xm and 
the other for x < Xm- Each group may have different values of parameters 
because of different mechanisms in two cases. 

The frequency density function N[x) is given by: 



N{x) 



fix) 



Ax 



(18) 



In the case of extreme values of x, we have very little mass i.e. very few 
observations and thus, we have many zeros in f[x)A)S-X table for the extreme 
values because of the limited and random nature of the observations. Thus 
to make a physical significance of observations, we increase the value of the 
interval Ax for the extreme values to avoid zero values of N{x). 

For small steps, cut-off parameters are of negligible importance. We there- 
fore put ^ = and i = for inicially 50% steps and estimate a and (3. Nq 
and Xm are estimated through frequency tables. Knowing these parameters, 
we estimate 6 and i for the best fit for the entire curve. Some time it may 
be necessary to re-estimate a in this stage. 



IV. Applications. 

Now we apply this model to describe the distribution of a parameterin 
some geophysical complex systems of interest 

(A) Water level of a river: 

For water, flood and agriculture management, it is extremely important 
to know the the distribution of water level in a river of interest. It is therefore 
regularly registered by water management department. In the present case 
we took the water level in Parana River, one of the important river in Brazil, 

9 



at the Sao Paulo-Parana border. The water level is measured daily by the 
Agenda Nacional de Agua and can be obtained at the site www.ana.gov.br. 
We analize water level in the period of 1** of January of 1964 to 30*'* of June 
of 2005, in total having 15,067 observations. This river receives water from 
many sources and the water level depends on rainfall at different places at 
different times and thus present a complex system with long term memory 
and long-range interactions. 

Through frequency distribution of the the empirical data, we observe 
maximum frequency density A'^o = 10, 032 days/m at the heigth of 2.87 m 
{xm)- For X > Xm, wc fouud a = 1.66, /? = 1.25, 9 = 0.185 and i = 2.85 for 
best fit. For x < Xm, we found a = 3.2, /5 = 0.54, 9 = 0.5 and i = 9.0 

In Figure 2a we compare log{ frequency) vs water /efe/(a) distribution. 
Plotting log(frequency) we can compare cases of even very small frequency 
at extremely higher water levels. In Figure 2b, we compare accumulated 
probability distribution density above water level a P{x > a) vs a. The 
agreement is good for the entire curve including extreme cases up to four or- 
ders of magnitude in accumulated probability density as well as in frequency 
density distribution. 
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Figure 2a - Frequency density vs water level (m) in semi log scale. The 
continuous line is through present model. The dotted points are empirical 
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Water level '.I (m) 

Figure 2b - Accumulated probability density above water level a 

(P(x > a)) vs water level^a.) in semi log scale. The dotted points are 

empirical, while the continuous line is through present model 



(B) Water flux in a river: 

Another problem of interest in water management is the distribution of 
water flux in a river at a particular point of interest. We have 5,428 obser- 
vations in the period of 23*'^ of October of 1969 to 31*'^ of August of 1984 
in Parana river at Sao Paulo-Parana border. Through frequency distribu- 
tion of the observations data, we find A^o = 24.4 days.seg./m^ at Xm = 155 
m^/s. For x > Xm, we found a = 0.84, p = 0.013, 9 = 0.00177 and i = 3.0. 
For X < Xm, we found a = O.S, (3 = 0.045. Due to the small number of 
observations on this side, we did not observe any truncation of q-values and 
thus consider 9 = and i = 0. In Figure 3a we cmpare log{frequency) vs 
water _flux while in Figure 3b we compare accumulated probability distri- 
bution density P{x > a) above water fluxes) vs water flux{a). Again the 
agreement is good throughout the curve including extreme values 
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Figure 3a - Frequency density vs water flux [m? / s) in semi log scale 
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Figure 3b - Accumulated probability density for water flux above a 

{P{x > a)) vs water flux a in semi log scale. The dotted points are 

empirical, while the continuous line is through present model 



(C) Rain precipitation 

To compare the distribution of rain precipitation, an importante problem 
for agriculture and water control, we took a time series of daily rain precipi- 
tation at Campinas city in Sao Paulo State, Brazil, in the period of 1950 to 
1980 at station prefix D4-044. These data were obtained from the Departa- 
mento de Aguas e Energia Eletrica of Sao Paulo State and are avalaible at 



http://www.sigrh.sp.gov.br In total we have 21,549 observations. 

In case of rain precipitation, probability of precipitation zero is a singular 
point as it also includes days when there is no rain at all. We are only 
considering days when there is rain precipitation more than zero. We have 
A^o = 598 days/mm for Xm = 1-0 mm. The values of the parameters are 
a = 1.7, (3 = 0.074, 6 = 0.08 and i = 0.12. The number of days when there 
is no rain precipitation is 15.558 days, thereby giving p(0) = 0.722. 

In Figure 4a, we plotted log{frequency) vs rain precipitation. In Figure 
4b we plotted accumulated probability density distribution P{x > a) for 
rain precipitation above a versus rain precipitation. a Again the agreement is 
good. 
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Figure 4a - Frequency density vs rain precipitation in mm. 
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Figure 4b - Accumulated probability density for rain precipitation above a 

{P{x > a)) vs rain precipitation a in semi log scale. The dotted points are 

empirical, while a continuous line is through present model. 
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(D) Financial Systems: Variation of an economical index. 

It has been shown recently that the variation of a share price in high 
frequency limit i.e. variation per minute is given through power law [36]. 
However for extreme values the predicted variation by power law is much 
more than what is observed and it must be truncated in some way or other. 
In the present case we took variation of the price of the share of Banco do 
Brasil, the biggest semi-government bank of Brazil. We consider variation 
per minute, i.e. in high frequency range. The period is from I''* of July of 
2004 to SO*'' of June of 2007 in total of 329,489 observations and furnished 
by IBOVESPA - Sao Paulo. In this case the probability of the variation zero 
is a singular point as it also includes all those minutes, when the share is 
not at all traded along with minutes when the share is traded but with zero 
variation. 

For positive variation we have frequency density A^o = 29, 640 at Xm = 
0.5.10"^ %. The values of the parameters for this side is a = 5.5, P = 0.197. 
We did not observe the effect of gradual truncation in this period and so 
we put ^ = and i = 0. For negative side we found Nq = 29, 640 at 
Xm = —0.5 * 10^^ %. The values of the parameters for this side are a = 4.0, 
P = 0.3325, 9 = 1.10"^ and i = 0.12. This means that for this side, the 
effect of the truncation although small, still is necessary. In Figure 5a, we 
shown log (frequency) vs percentage variation in unit of 10^^ The agreement 
is good. In Figure 5b we only compare the accumulated probability for the 
10,000 highest variations in frequency density as they are most important. 
Again we found a good agreement. 
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Figure 5a - Frequency density vs. percentage variation (.10'^) in share 

price/min. 
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Figure 5b - Accumulated probability density for share price variation 
above a (P(x > a)) vs. share price variation for extreme values in semi log 
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scale. The dotted points are empirical, while a continuous line is through 

present model.. 



(E) Distribution of the sun spots. 

The number of sun spots per month is a very old index and is avalaible 
from 1611. This index measures the magnetic activity in the sun. Sun spot 
number data can be obtained from National Geophysical Data Center in 



Boulder, Colorado and is avalaible at the site http://www.ngdc.noaa.gov 



In the Figure 6a we show the distribution of monthly sun spots from 
1749 to 2007 and compare with present model with parameters A^o = 58.4 
monthVsun spot for x^ = 2.5, a = 0.8, (3 = 0.04, 9 = 0.0075 and i = 1.9. In 
Figure 6b we compare the accumulated probability density for x > a versus 
a. The agreement is good in both cases. 
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Figure 6a - Frequency density vs. sun spots/month 
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Sim spots a/montli 

Figure 6b - Accumulated probability density for sun spots/month more 

than a (P(x > a)) vs. sun spots/month (a) in semi log scale. The dotted 

points are empirical, while a continuous line is through present model.. 

V. Discussion 



In the present work, we presented a statistical distribution considering 
that the entropic index (g — 1), which gives information about long range 
interactions and/or memory effects, decreases exponentially with size of the 
variable. This distribution automatically gives a power law in the central part 
and deviates for very small and very large values of the variable as normally is 
observed in most of the complex systems. It gives finite variance as required 
through central limit theorem. In the present work we applied this model 
for various geophysical and financial systems, and found a good agreemnt 
in all cases. We tried this model in some other cases like citation index of 
scientists and marks distribution in an entrance examination [40], citation 
index of scientific publications [42] where also we obtain a good agreement 
for eight order of magnitude. In certain cases, due to limited observations, 
we could not estimate the values of gradually truncation parameters and thus 
consider them equal to zero. 



This distribution present an statistics for complex systems which is vahd 
for entire range and can be used by geophysical and financial professionals. 
Thus it elliminate the necessity to use distribution with variable parameters 
or an approximate distribution which is valid only in a limited range. It has a 
strong physical basis and we have shown a good agreement up to four orders 
of magnitude or more. Thus it provide a confiable standard distribution for 
these systems. This model also present an universal nature of the truncation 
process in the distribution of a parameter of a complex system obeying power 
law. 
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